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Creating  and  Validating  a Large  Image  Database  for  METTREC 

Michael  D.  Garris  and  William  W.  Klein 
National  Institute  of  Standards  and  Technology 


ABSTRACT 


The  National  Institute  of  Standards  and  Technology  (NIST)  is  in  the  process  of  setting  up  a new  series  of 
conferences  named  the  Metadata  Text  Retrieval  Conferences  (METTREC).  It  will  focus  on  evaluating 
document  conversion  using  optical  characto*  recognition  (OCR),  and  information  retrieval  (IR) 
technologies.  Evaluations  wUl  be  designed  to  investigate  the  impact  of  machine  recognition  errors  upon 
information  retrieval  and  to  determine  what  intofaces  are  appropriate  to  integrate  the  two  technologies. 
To  implement  this  conference,  we  require  databases  that  can  be  used  for  conference  evaluations  and  has 
chosen  the  Federal  Register  to  be  the  initial  document  source.  It  is  a large,  complete  set  of  documents 
containing  metadata  that  wiU  allow  quantitative  evaluation  of  recognition  and  retrieval  technologies.  This 
paper  describes  the  activities  associated  with  scanning  the  Federal  Register  and  validating  the  document 
images  within  the  database.  The  process  of  image  validation  includes  translating  filenames,  assuring 
image  integrity,  and  verifying  correct  page  sequences.  In  order  to  reduce  the  cost  of  validation,  we 
minimized  human  resource  expenditure  by  exploiting  OCR  and  high-speed  visual  adjudication  from 
images  by  an  operator.  This  process  minimizes  the  expensive  handling  of  paper  to  validate  document 
image  collections. 

Keywords:  CD  ROM,  document,  image  database,  information  retrieval,  METTREC,  optical  character 
recognition,  OCR,  quality,  scanning,  technology  evaluation 

1.0  INTRODUCTION 

The  Information  Technology  Laboratory  (ITL)  at  the  National  Institute  of  Standards  and  Technology 
(NIST)  has  conduaed  extensive  research  in  optical  characta  recognition  [1]  and  text  retrieval 
technologies  [2]  [3]  [4].  In  both  areas,  a number  of  conferences  have  been  held  to  evaluate  and  understand 
the  state  of  the  technology  [5]  [6]  [7]. 

In  keeping  with  prior  evaluation  methods,  the  ITL  under  joint  sponsorship  with  the  Dq^artment  of 
Defense  is  in  the  process  of  setting  up  a new  series  of  conferences  named  the  Metadata  Text  Retrieval 
Conference  (METTREC).  It  wiU  focus  on  evaluating  document  conversion  using  optical  character 
recognition  (OCR)  and  information  retrieval  (IR)  technologies  within  the  context  of  integrated  tasks. 
Evaluations  wiU  be  designed  to  investigate  the  inpact  of  machine  recognition  technology  upon 
information  retrieval  and  to  detaruine  what  interfaces  are  appropriate  to  integrate  the  two  technologies. 

This  effort  requires  a database  that  can  be  used  for  conference  evaluations.  The  Federal  Register  for 
1994  was  chosen  to  be  the  source  for  this  database  because  it  is:  (1)  a complete  set  of  documents  within 
the  pubUc  domain;  (2)  a large  coUection  containing  over  250  issues  consisting  of  over  67,000  pages  of 
information;  (3)  a structured  document  set  whose  hiaarchy  contains  metadata;  (4)  a coUeaion  of  pages 
containing  significant  variations  in  print  and  image  quaUty;  and  (5)  a set  of  documents  for  which  the  text 
for  the  entire  coUection  is  stored  within  electronic  files.  Although  the  latter  is  not  part  of  this  papo',  the 
text  stored  within  electronic  files  wiU  aUow  us  to  do^ive  the  “ground  truth”;  this  rqiresents  the  correct 
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ASCn  characta-s  that  an  OCR  and  IR  system  should  recognize  and  retrieve  and  will  allow  us  to  quantify 
recognition  and  retrieval  accuracy. 

To  conduct  evaluation  conferences,  involving  scientific  experiments,  training  and  testing  materials  must 
be  rigorously  prq^ared.  These  materials  are  comprised  of  two  types  of  data:  (1)  document  images  and  (2) 
document  text  and  metadata  tags.  This  paper  focuses  on  scanning  and  validating  document  images.  A 
validated  image  must  rq)resent  the  correct  paper  page,  contain  an  untruncated  bitmap,  have  reasonable 
pixel  dimensions,  and  contain  an  acceptable  range  of  rotational  skew.  With  a large  document  collection, 
this  is  a tedious  and  expensive  process.  OCR  and  other  automated  techniques  can  be  used  to  minimize 
the  cost  of  prq)aring  these  materials.  This  paper  documents  the  process  by  which  we  scanned  and 
validated  the  Federal  Register  image  database  that  will  be  used  for  METTREC. 

2.0  FORMAT  OF  FEDERAL  REGISTER 

The  United  States  Government  Printing  Office  (GPO)  prints  the  Federal  Register  each  work  day  to  record 
the  transactions  of  the  government.  The  Federal  Register  is  the  official  daily  publication  for  Rules, 
Proposed  Rules,  and  Notices  of  Federal  agencies  and  organizations,  as  well  as  Executive  Orders  and  other 
Presidential  Documents.  Each  issue  is  printed  and  bound  into  a book(s).  Usually,  the  GPO  publishes  one 
book  per  day  that  is  printed  in  mostly  9 point  Vermilion  font.  During  1994  due  to  the  printing  volume, 
the  GPO  printed  multiple  books  on  April  25  and  November  14.  Each  book  contains  three  distina 
sections: 

• Prefix:  Prefix  pages  consist  of  three  types  of  pages:  a hard  cover  page,  a soft  covct  page,  and  content 
pages.  Figures  1 through  3 provide  illustrations  of  prefix  pages.  Figure  1 illustrates  a hard  covct 
page  that  contains  the  date,  the  volume  number,  an  address  label  area,  and  a postal  class 
identification;  it  is  printed  on  high  grade  kraft  paper  to  minimize  or  prevent  damage  incurred  by 
postal  processing  and  handling.  As  illustrated  in  Figure  2,  a soft  cover  page  identifies  the  date,  the 
volume  number,  and  the  page  numbering  sequence  for  the  body  pages  contained  within  the  book.  A 
content  type  of  page,  illustrated  in  Figure  3,  contains  a page  heading  that  includes  a page  number 
which  is  instantiated  as  an  uppo’  case  Roman  numeral.  The  first  content  page  contains  general 
information  with  regard  to  the  Federal  Register  and  its  usage.  Each  subsequent  content  page 
identifies  the  contents  of  the  Federal  Register  book  for  the  day.  Both  soft  cove"  and  content  pages 
are  printed  on  recycled  newspaper  quality  of  paper. 

• Body:  Figures  4-5  illustrate  typical  body  pages  contained  within  a book.  A body  page  provides  a 
record  of  the  meeting  notices,  proposals,  and  transactions  of  the  United  States  government  for  the 
day.  There  are  two  type  of  body  pages:  section  and  detail.  A section  type  of  page,  illustrated  in 
Figure  4,  is  similar  in  appearance  to  a soft  covo"  page  and  is  used  to  divide  the  Federal  Register  into 
distinct  sections  by  topic.  It  contains  the  name  of  the  issuing  agency,  the  Code  of  Fedo'al  Record 
(CFR)  title  and  part(s)  affeaed,  and  a brief  description  of  the  specific  section  subject;  this  type  of 
page  does  not  contain  a page  numba*  field.  A detail  page,  illustrated  in  Figure  5,  elaborates 
Presidential  and  Executive  C)rder(s),  Rules  and  Regulations,  Proposed  Rules,  and  Sunshine  Act 
Meeting  Notices.  Each  page  contains  a page  heading  that  includes  a page  number  which  is 
instantiated  as  an  Arabic  number.  Both  section  and  detail  pages  are  printed  on  recycled  newspaper 
quality  of  papa*. 

• Appendix:  Figure  6 illustrates  a typical  appendix  page  contained  within  a book.  The  appendix 
consists  of  pages  that  provide  reads'  aids  wMch  allow  a reader  of  the  Federal  Register  to  access 
information  and  to  index  specific  information  contained  within  multiple  previously  published  Federal 
Register  book(s).  An  appendix  page  contains  a page  heading  that  includes  a page  number  which  is 
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instantiated  as  an  lower  case  Roman  numeral.  Appendix  pages  are  printed  on  recycled  newspapa: 
quality  papo*. 

With  the  exception  of  cover  and  section  pages,  each  page  of  the  Federal  Register  is  printed  with  a page 
heading  that  includes  a text  banner  printed  above  two  horizontal  lines.  The  text  banner  line  contains 
information  that  identifies  the  document,  the  volume,  the  date,  the  topic,  and  a page  number. 

Each  Federal  Register  book  ends  with  a blank  hard  cover  page  that  minimizes  or  prevents  damage 
incurred  by  postal  processing  and  handling. 
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specified  assessment  rate  to  cover  such 
expenses  will  tend  to  effectuate  the 
declared  policy  of  the  Act. 

It  is  funher  found  that  good  cause 
exists  for  not  postponing  the  effective 
date  of  this  action  until  30  days  after 
publication  in  the  Federal  Register  |S 
U.S.C  SS31  because  the  Comicittea 
needs  to  have  sufficient  funds  to  pay  its 
expenses  which  are  incurred  on  a 
continuous  basis.  The  1994-95  fiscal 
year  for  the  program  began  )uly  1, 1994. 
The  marketing  order  requires  that  the 
rate  of  assessment  apply  to  all 
assessable  papayas  handled  during  the 
fiscal  year.  In  addition,  handlers  are 
aware  of  this  action  which  was 
recommended  the  Committee  at  a 
public  meeting  and  published  in  the 
Federal  Register  as  an  interim  final  rule. 
No  comments  were  received  concerning 
the  interim  final  rule  that  U.adopted  in 
this  action  as  a final  rule  without 
change. 

List  of  Snbjeds  in  7 CFR  Part  928 
Marketing  agreements.  Papayas. 
Reporting  and  recordkeeping 
requirements. 

For  the  reasons  set  forth  in  the 
preamble.  7 CFR  part  928  is  amended  as 
follows: 

PAFTT  928-PAPAYAS  GROWN  IN 
HAWAII 

Accordingly,  the  interim  final  rule 
amending  7 CFR  part  928  which  iva.s 
published  at  59  FR  33H98  on  July  1. 
1994,  is  adopted  as  a final  rule  without 
change. 

Dated;  August  25. 1994. 

Eric  M.  Foniuui. 

Acting  Deputy  Director.  Fruit  and  Veg/itable 
■Division. 

IFR  Doc.  94-21636  FiW  8-31-94:  8:45  ami 
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7CFR  Part  947 

[Docket  No.  FV94-»«7-2nni 

Oregoiv-Cantontia  Potatoes;  Expenaee 
and  Aasesement  Rate 

AGENCY:  Agricultural  Marketing  Service. 
USDA. 

ACTION:  Final  rule. 

SUMMARY:  The  Department  of 
Agriculture  (Depaitroeat)  is  adopting  as 
a final  rule,  without  change,  the 
provisions  of  on  interim  final  rule  that 
authorized  expenses  and  established  an 
assessment  rate  that  will  generate  funds 
to  pay  those  expenses.  Authorization  of 
this  budget  enablet  the  Oregon- 
California  Potato  Committee 
[Committee)  to  incur  expenses  that  aro 


masonoble  and  necessary  to  administer 
the  program.  Funds  to  administer  this 
program  are  derived  from  assessments 
on  handlers. 

EFFECTIVE  DATES:  July  1, 1994.  through 
luneSO.  1995. 

FOR  FURTHER  INFORMATION  CONTACT: 
Martha  Sue  Clark.  Marketing  Order 
Administration  Branch.  Fruit  and 
Vegetable  Division.  AMS.  USDA.  P.O. 
Box  96456,  room  2523-S.  Wa^inglon, 
OC  20090-6456.  telephone  202-720- 
9918,  or  Teresa  L.  Hutchinson, 
Narthwest  Marketing  Field  Office.  Fruit 
and  Vegetable  Division,  AMS,  USDA, 
Creen-Wyatt  Federal  Building,  room 
369. 1220  Southwest  Third  Avenue. 
Portland,  OR  97204,  telephone  503- 
326-2724. 

sumEMENTARY  MFORMATION:  This  rule 
is  issued  under  Mariceting  Agreement 
Na  114  and  Order  No.  947,  both  as 
amended  (7  CFR  part  947],  regulating 
the  handling  of  Irish  potatoes  grown  in 
Oragon-Calmrnia.  The  marketing 
agreement  and  order  ate  elective  under 
the  Agricultural  Marketing  Asreenient 
Act  of  1937.  as  amended  I7  U.S.C  601- 
674).  hereinafter  referred  to  as  the  Act. 

The  Department  is  issu^  this  rule  in 
conformance  with  Executive  Order 
12866. 

This  rule  lias  been  reviewed  under 
Executive  Order  12778,  Qvit  Justice 
Reform.  Under  the  marketing  order  now 
in  effect  Oregan-Califomia  potato 
handlers  ere  subject  to  asaassmente. 
Funds  to  administer  the  Oregon* 
California  potato  order  are  rferived  from 
such  assessments.  It  is  intended  that  the 
assessment  rate  as  Issued  herein  will  be 
applicable  to  ell  assessable  potatoes 
during  the  1994-95  fiscal  period,  which 
began  July  1, 1994,  and  ei^  June  30, 
1995.  This  final  rule  will  not  preempt 
any  State  or  local  Iaws..regulations,  or 
policies,  unless  they  present  an 
irreconcilable  conflla  with  this  rule. 

The  Act.  provides  that  administrative 
proceedings  must  be  exhausted  before 
parties  may  file  suit  in  court  Under, 
section  8cllS)(A)  of  the  Act,  any  handler 
subject  to  an  order  may  file  with  the 
Secretary  a petition  stating  that  the 
order,  any  provision  of  the  order,  or  any 
obligation  imposed  in  connection  with 
the  order  is  not  in  ecoordanoe  with  law 
and  requesting  a modification  of  the 
order  or  to  be  exempted  tbeiefinom.  Such 
handler  is  afibrded  tha  opportunity  for 
a hearing  on  the  petition.  Afler  die 
bearing  the  Secretary  would  rule  on  the 
petition.  The  Act  provides  that  the 
district  court  of  the  United  States  in  any 
district  in  which  the  handler  is  an 
inhabitant  or  has  his  or  her  principal 
place  of  business,  hos  jurisdiction  in 
equity  to  review  the  Secretary’s  ruling 


on  (lie  petition,  provided  a bill  in  equity 
is  filed  not  later  than  20  days  after  the 
date  of  the  entry  of  the  ruling 

Pursuant  to  the  requirements  set  forth 
in  the  Rfi^latory  Fkacibility  Act  (RFA). 
the  Administrator  of  the  A^cultural 
Marketing  Service  (AMSJ  has 
consider^  the  economic  impact  of  this 
rule  on  small  entities. 

The  purpose  of  the  RFA  is  to  fit 
regu  latory  actions  to  the  scale  of 
business  subject  to  sticb  actions  in  order 
that  small  businesses  will  not  he  unduly 
or  disproportionately  burdened. 
Marketing  orders  issued  pursuant  to  the 
Act.  and  &e  rules  issued  thereunder,  am 
unique  in  that  they  are  brought  about 
through  group  action  of  essentially 
small  entities  acting  on  their  own 
behalf.  Thus,  both  statutes  have  small 
entity  orientation  and  compatibility. 

There  are  approximately  550 
producers  of  Oregon-CaUfoniia  potatoes 
under  this  marketing  (»dar,  and 
approximately  40  handlers.  Small 
agricultaral  producers  have  been 
defined  by  the  Small  Business 
Administmtimi  (13  CFR  121.601)  as 
those  having  aiuiual  receipts  of  less  than 
SSOO.OOO,  and  small  agricultuTal  service 
firms  are  defined  as  thoae  whose  annuot 
receipts  ora  less  thw  $5,000400.  The 
mpjority  of  Greeon-Califoinia  potato 
producers  and  handlers  may  be 
clamified  as  small  entities. 

The  budget  of  expenses  for  the  1994- 
95  fiscal  period  was  prepared  by  (be 
Oregon-Califorme  Potato  Committee,  the 
agency  responsible  for  local 
administration  of  the  marimUng  order, 
and  submitted  to  the  Department  for 
approval.  Tba  members  of  the  * 
Coinmltlee  are  producers  and  handlers 
of  Or^on-California  potatoes.  They  are 
familiar  with  the  Committee’s  needs  and 
with  the  costs  of  goods  and  services  in 
their  local  area  and  are  thus  in  a 
position  to  foiRiulate  an  appropriate 
budget.  Tha  budget  was  formulated  and 
discussed  in  a public  m^ng  Thus,  all 
directly  affected  persons  have  had  an 
opportunity  to  paitidpale  and  provide 
input. 

The  assessment  rate  recommended  by 
the  Committee  was  derived  by  dividing 
anticipated  expenses  by  expected 
shipments  of  Qregon-C^foTnia 
potatoes  Because  that  rate  will  be 
applied  to  actual  chipmenta,  it  must  be 
established  at  a rate  that  will  provide 
sufficient  Income  to  pay  the 
Committee's  expenses. 

The  CommittM  unanimously 
recommended  a budget  of  $45,100, 
$1,500  more  tfaw  last  season.  Increases 
in  expenditures,  which  include  Sl'so  for 
the  CommittM's  annual  report,  $50  for 
the  Committee’s  audit,  $1,000  for 
inspection  fees.  $500  for  investigation 


Figure  5.  Typical  Detail  Page. 
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Figure  6.  Typical  Appendix  Page. 


3.0  IMAGE  SCANNING 


Before  scanning  each  Federal  Register  book,  its  pages  were  cut  from  its  binding.  This  resulted  in  paps' 
pages  that  wse  approximately  20  CM  (8”)  by  28  CM  (11”)  in  size.  The  NIST  Federal  Register 
collection  is  missing  one  issue  for  the  1994  calendar  year:  March  10.  The  collection,  containing  255 
books,  was  scanned  by  a contractor  at  its  RockviUe,  MD  facility.  Image  scanning  was  performed  using  a 
Kodak  923^  scanns  at  a resolution  of  15.75  pixels  ps  miUimets  (400  pixels  ps  inch  (PPI))  to  output  a 
compressed  bitonal  tagged  image  formatted  file(TIFT™[8]).  Image  files  wse  written  to  CD  Recordable 
cartridges  (CDR)  using  an  image  file  naming  convention,  “/Nmm/nn/nn/nn/im.tif  ’ whse: 

□ mm  denotes  the  CDR  volume  whse  mm  = 0 1 . . 23 

□ nn  denotes  a path  or  filename  whse  nn  = 0..99 

This  naming  convention  allows  for  storage  of  100  image  files  within  each  sub-directory  entity.  For 
example,  the  image  file  “N1 8/00/00/00/00. tif’  contains  page  number  44606.  Page  number  44607  is 
contained  on  the  next  sequentially  numbered  file.  This  numbering  convention  continues  with  image  file 
"N 1 8/00/00/0 1/OO.tif’  containing  page  number  44707. 

i^proximately,  67,000  Federal  Register  pages  were  scanned  and  the  resultant  TIFF™  images  were 
written  to  22  CDR  cartridges  (CDR  cartridge  N20  was  skipped).  Figure  7 illustrates  the  GPO  printing 
volumes  in  terms  of  pages  per  month. 


Figure  7.  Monthly  Printing  Volumes  for  1994  of  the  Federal  Register. 


4.0  IMAGE  VALIDATION  PROCESSING 

Figure  8 illustrates  the  general  process  flow  that  was  used  to  map  and  validate  the  CDR  images  from  the 
above  serial  structure,  “/Nmm/nn/nn/nn/nn.tif ’,  into  a mapping  structure  which  uses  a month,  day,  and 
page  numbo'  convention.  This  process  translated  the  filenames,  assured  image  quality,  and  verified  the 
correct  page  was  stored.  It  consisted  of  Sections  4.1  Image  Name  Mapping,  4.2  Image  Quality  Checks, 
4.3  Arabic  Page  Numbered  Image  Verification,  4.4  Roman  Page  Numbered  Image  Vo'ification,  and  4.5 
Image  Sequence  Check  Verificatioa 


^ Specific  hardware  and  software  products  identified  m this  paper  were  used  in  order  to  adequately  support  the 
development  of  the  tedmology  described  in  this  document.  In  no  case  does  such  identification  imply 
recommendation  or  endorsement  by  NIST,  nor  does  it  imply  that  die  equipment  identified  is  necessarily  the  best 
available  for  the  purpose. 
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Figure  8.  Top  Level  Federal  Register  Validation  Process  Flow. 

4.1  Image  Name  Mapping 

As  each  image  file  was  read  from  CDR,  the  path  and  filename  was  mapped  into  a name  consisting  of  a 
two-digit  month  subdirectory  name,  a two-digit  day  subdireaory  name,  and  an  eight  charaaer  filename 
suffixed  with  a three-charaaer  filename  extension  (“/mm/dd/filename.ext”  where  mm  = 1..12  and  dd  = 
L.31).  Daily,  the  GPO  generates  Microcomp  files,  for  each  Federal  Register  book,  that  contain 
information  required  to  perform  this  mapping.  We  created  a mapping  index  and  verified  that  it  was 
correct  by  manually  checking  it  against  each  Federal  Register  book. 

Each  eight-character  filename  conforms  to  the  convention  of  “tOOnnnnn”  where  “t”  denotes  the  type  of 
Federal  Register  page  and  “OOnnnnn”  represents  a page  numbo*.  ’’nnnnn”  is  zero  filled  and  padded  to  the 
left.  The  file  naming  conventions  were: 

□ Prefix  page  image  filenames  were  assigned  “t  = a”;  “OOnnnnn”  was  reset  to  zero  for  each  book. 
The  hard  cover  page  was  named  “aOOOOOOO”  and  the  soft  cover  page  was  named  “aOOOOOOl”. 
Each  subsequent  prefix  page  was  assigned  a page  number  that  was  sequentially  incremented. 
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□ Body  page  image  filenames  were  assigned  “t  = b”;  “OOnnnnn”  was  sequentially  incremented  for 
the  entire  year  and  is  the  actual  Federal  Register  page  number. 

□ Appendix  page  image  filenames  were  assigned  “t  = c”;  “OOmmnn”  was  reset  for  each  book.  The 
initial  appendix  page  was  named  “cOOOOOOl”.  Each  subsequent  appendix  page  was  assigned  a 
page  number  that  was  sequentially  incremented. 

4.2  Image  Quality  Checks 

Next,  each  image  file  was  convmed  from  TIFF™  to  NIST  EHead  [9]  format.  During  this  conversion, 
truncated  and/or  corrupted  bitmaps  generated  by  the  scanning  process  were  detectable.  If  a bad  image 
bitmap  was  detected,  the  image  name  was  flagged  and  its  original  Federal  Register  page  was  rescanned. 
If  not,  a three-character  “pa”  extension  was  assigned  and  the  file  was  stored  upon  on-line  magnetic  file 
storage. 

In  orda  to  ensure  an  image  file  was  ready  for  furtha  verification  processing,  a series  of  image  quality 
checks  wae  performed  to  ensure  that  each  file  conformed  to  the  following  charaaeristics: 

□ Resolution  = 15.75  pixels  per  millimeter  (400  PPI) 

□ Compressed  file  size  of  CCITT  Group  4 [10]  image  ^ 30  K Bytes  (KB) 

□ Width  < 4000  pixels 

□ 4200  pixels  < Height  ^ 4900  pixels 

An  image  file  that  did  not  conform  to  these  charaaeristics  was  flagged  and  the  original  Federal  Register 
page  was  rescanned. 

We  decided  not  to  store  any  images  of  blank  Federal  Register  pages  on  the  output  media.  All 
compressed  image  files  of  less  than  30  KB  were  visually  inspeaed,  vaified  to  be  blank  within  the 
Federal  Register  paper  book,  and  deleted  from  on-line  magnaic  storage.  However,  the  image  file  sizes  of 
several  blank  pages  wae  greater  than  the  30  KB  aiteria  (due  to  excessive  speckling)  and  were  eliminated 
during  subsequent  validation  procedures. 

Figure  9 illustrates  the  rescanning  attributable  to  image  quality  problems  that  was  daeaed  by  the  above 
quality  thresholds.  It  does  not  include  1362  image  rescans,  occurring  predominately  in  month  “7”,  due  to 
an  incorrect  scanning  resolution  of  11.81  pixels  per  millimaer  (300  PPI).  In  total,  approximately  1790 
pages  wae  rescanned  which  represented  2.7%  of  the  total  numba  of  pages.  All  rescanning  was 
performed  in  our  laboratory  using  a Fujitsu  3096G^  scanna. 


Figure  9.  Rescans  Due  to  Image  Quality. 
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4.3  Arabic  Page  Numbered  Image  Verification 


Figure  10  illustrates  the  process  steps  that  validated  each  body  page  image  of  the  Federal  Register, 
“hOOnnnnn”  It  consisted  of  Sections  4.3.1  Locate  Page  Number  and  Create  Subimage,  4.3.2  Optical 
Characta-  Recognition  of  Page  Number,  4.3.3  Full-Page  Image  Adjudication,  4.3.4  Compare  OCR 
Results  to  Image  Filename,  and  4.3.5  Subimage  Adjudication  using  Page  Number. 


Figure  10.  Arabic  Numbered  Type  of  Page  Processing. 

4.3.1  Locate  Page  Number  and  Create  Subimage 

Figure  5 illustrates  the  page  heading  of  a Federal  Register  body  type  of  page  which  contains  a page 
number  field.  Depending  upon  the  page  face,  the  page  number  will  be  printed  on  either  the  left  edge  of 
the  heading  when  it  is  even  numbered  or  right  edge  of  the  heading  when  it  is  odd  numbered.  Each  image 
was  decompressed  and  a subimage  of  the  top  5 CM  (800  pixels)  was  created  from  the  raster  image.  The 
subimage  was  spatially  reduced  by  a faaor  of  3 to  increase  efficiency,  a skew  angle  was  computed[ll]. 
If  the  subimage  was  skewed  by  more  than  0.2  degrees,  it  was  rotated  and  the  skew  was  removed.  Then, 
the  horizontal  header  lines,  below  the  text  baimer,  were  located  and  the  subimage  was  truncated  to 
exclude  them. 
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Isolation  of  the  page  number  field  was  atten^ted  by  examining  the  text  banno-  line  and  locating  the  page 
number  field  on  either  the  right  or  left  side.  If  this  isolation  was  unsuccessful,  the  failing  full-page  image 
was  sent  to  Section  4.3.3  for  adjudication.  If  successful,  a subimage  file  containing  the  page  number  was 
created  and  was  input  to  the  recognition  processing. 


Figure  1 1 provides  a graphical  analysis  of  page  number  isolation  failures  that  include  measurements  of 
blank,  seaion,  and  skewed  pages.  Figure  12  illustrates  an  example  of  a subimage  (Width  = 192  pixels 
and  Height  = 53  pixels)  that  has  been  enlarged  by  150%. 


Blank  page 

— - - . Section  page 
Skewed  page 


Figure  11.  Page  Number  Isolation  Failures. 


55269 


Figure  12.  Typical  Subimage. 

4.3.2  Optical  Character  Recognition  (OCR)  of  Page  Number 

Since  over  95%  of  the  Federal  Register  consists  of  body  type  of  pages,  NIST  decided  to  use  OCR  rather 
than  human  inspection  to  validate  that  each  candidate  image  contained  the  correa  Federal  Register  page, 
and  that  it  was  in  its  correa  position  within  the  month  and  day  direaory  structure. 

Initially,  we  examined  and  tested  page  number  recognition  with  three  commercially  available  OCR 
produas  that  executed  in  Miaosoft  Windows/NT  and  UNIX  environments.  Accuracy  from  each  of  the 
three  produas  was  daermined  to  be  unsatisfaaory  to  NIST  because  the  amount  of  touching  charaaas 
contained  within  the  Federal  Register  yielded  a low  omni-font  recognition  accuracy. 

As  a result  of  this  evaluation,  NIST  decided  to  use  it  own  OCR  capabilities  to  recognize  the  digits 
contained  within  the  page  number  subimage.  This  involved  the  following  activities: 

4.3.2. 1 Segmentation 

Our  previous  releases  of  the  NIST  public  domain  Form-Based  Handprint  Recognition  System  [9] 
(HSFSYS)  contained  segmentation  software  that  isolated  handprinted  charaaers.  This  software  was 
modified  and  adapted  to  segment  9-point  VamHion  page  number  machine  printed  digits.  Figure  13  and 
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Figure  14  illustrate  typical  Federal  Register  page  number  fields  that  contain  digits  which  form  touching 
characters.  Our  handprint  segmentation  software  was  modified  to  propaly  segment  machine  printed 
characters  similar  to  these  examples. 


Figure  13.  Touching  Characters  caused  by  too  much  ink. 


Figure  14.  Touching  Characters  caused  by  ink  bleeding  through  page. 


4.3.2.2  Classification  and  Training 

HSFSYS  Release  1.0  contains  software  that  uses  a Probabilistic  Neural  Network  (PNN)  [12]  to  classify  a 
segmented  character.  Release  2.0  contains  software  that  added  a Multi-Layered  Perceptron  (MLP)  [13] 
to  classify  a segmented  character.  For  this  application,  a PNN  classifiQ-  was  chosen  to  classify  segmented 
charaaers  on  the  basis  of  taking  less  effort  to  train  than  an  equivalent  MLP  classifier.  The  PNN  classifier 
was  trained  to  recognize  10  digits,  0-9.  Due  to  time  constraints,  the  classifio'  was  trained  using  only  100 
discrete  samples  per  digit  class.  We  know  that  a larger  training  set  would  improve  recognition  accuracy. 

4.3.2.3  OCR  System  Accuracy 

For  64,384  subimages,  our  system  accuracy  for  OCR  of  the  page  number  field  achieved  an  88.1%  overall 
correa  recognition.  System  accuracy  includes  page  number  isolation  orors  as  well  as  typical  OCR 
segmentation  and  classification  errors.  Figure  15  presents  a graphical  view  of  each  month’s  accuracy. 
The  variance  in  bleed  through,  smudged,  and  lightly  printed  ink  conditions  provided  an  extremely 
difficult  recognition  challenge. 


Figure  15.  OCR  System  Accuracy  on  Page  Number  Fields. 
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Over  40%  of  the  OCR  errors  are  attributable  to  improper  character  segmentation  of  the  page  number 
subimages;  this  is  an  unusually  high  po'centage  of  failure  caused  by  touching  and  incomplete/broken 
charaaers.  The  remaining  errors  are  attributable  to  impropo*  classifications  by  our  PNN  classifier. 

Our  OCR  accuracy  varied  greatly  with  the  quality  of  the  printed  material.  The  GPO  printed  the  Federal 
Register  on  recycled  newspapa  quality  of  paper,  highly  absorbent,  using  a printing  plate  that  often 
contained  eitha-  too  much  ink  or  too  little  ink.  As  illustrated  in  Figure  16,  too  much  ink  and/or  bleed 
through  conditions  resulted  in  failures  by  our  segmentation  software  to  correctly  segment  three  or  more 
touching  charaaers. 

As  illustrated  in  Figure  17,  too  little  ink  resulted  in  missing  digit(s)  and/or  broken  charaaer  segments. 
Although  our  software  correctly  segmented  and  classified  a subimage  containing  missing  charaaer(s),  it 
scored  the  result  as  an  OCR  error  in  the  above  accuracy  graph  because  the  OCR  result  did  not  exactly 
match  the  filename  string  (Section  4.3.3.3).  As  illustrated  in  Figure  18,  the  ink  is  lightly  printed.  At 
times,  our  segmentation  software  cut  the  non-contiguous  sections  of  charaaers  into  multiple  charaaas. 
Both  of  these  conditions  were  major  contributors  to  the  number  of  segmentation  errors. 


Figure  16.  Segmentation  Error  caused  by  Touching  Digits. 


^40 


Figure  17.  OCR  Error  caused  by  Missing  Printed  Digit(s). 
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18.  Segmentation  Error  caused  by  Light  Ink  Pi 
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Figure  19.  Classification  Error  caused  by  Incomplete  Training  Set. 
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Figure  20  illustrates  a confusion  matrix  that  was  generated  by  examining  all  mismatched  recognition 
result  strings,  4269  occurrences,  which  contained  the  same  number  of  digits  as  its  associated  truth 
(filename)  string.  Although  it  does  not  completely  eliminate  segmentation  orors  from  the  analysis,  it 
does  provide  an  interesting  view  of  false  classifications  by  our  PNN. 
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Figure  20.  Confusion  Matrix. 


A numbo"  of  obsowations  can  be  drawn  from  the  above  confusion  matrix.  Fifty  five  percent  of  the  total 
numbo'  of  errors  in  the  matrix  are  rq)resented  by  the  following  six  confusion  pairs: 

□ “6”  classified  as  an  “8”  (30%) 

□ “0”  classified  as  an  “8”  (12%) 

□ “5”  classified  as  a “6”  (3.4%) 

□ “5”  classified  as  an  “8”  (3.3%) 

□ “9”  classified  as  an  “8”  (3.8%) 

□ “3”  classified  as  an  “8”  (2.5%) 

Upon  visual  inspection,  it  was  determined  that  ink  bleed  through  was  the  primary  source  of  oror.  In 
addition,  sixteen  po^cent  of  the  total  number  of  orors  in  the  matrix  are  rq^resented  by  the  following 
confusion  pairs: 

□ “3”  classified  as  a “1”  (4.6%) 

□ “6”  classified  as  a “4”  ( 1 1 .4%) 

These  orors  are  primarily  attributed  to  the  limited  size  of  characta  samples  used.  Training  the  PNN  with 
100  charaao’  samples  po*  class  is  not  sufficient.  Recognition  accuracy  can  be  improved  by  training  on  a 
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larger  number  of  diversified  samples.  Figure  19  provides  an  illustration  of  this  condition;  our  character 
training  sets  did  not  include  enough  distinaive  partial  charaaers  to  classify  incomplete  charaaers. 

4.3.3  Full-Page  Image  Adjudication 

Whenever  our  page  number  isolation  software  failed  or  the  visual  subimage  adjudication  was  not 
successful,  full-page  image  adjudication  processing  was  performed  by  an  operator  using  a 2.6  pixels  pa- 
millimeter  (66  PPI)  reduced  full-page  image.  As  illustrated  in  Figure  21,  it  allowed  an  operator  to 
adjudicate  blank  and  non-blank  page  images. 
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4.3.3.1  Blank  Page  Image  Adjudication 

Whenever  our  page  numbo’  isolation  software,  described  in  4.3.1,  failed  to  locate  and  isolate  a page 
number,  full-page  images  were  routed  to  this  process  step.  An  operator  viewed  each  candidate  and 
classified  the  image  as  eitha-  a blank  or  non-blank  page. 

□ A blank  page:  Due  to  speckling,  sev^al  blank  page  images  wo’e  encountered  because  their  file 
sizes  exceeded  the  30KB  image  quality  threshold. 

□ A non-blank  page:  A section  page  within  the  body  of  the  Federal  Register  does  not  contain  a 
page  number  header  and  failed  the  isolation  processing.  Extremely  skewed  pages  also  failed 
this  processing.  Upon  visual  inspeaion,  those  images  that  contained  more  than  ten  degrees  of 
skew  were  rescanned.  For  each  page,  an  operator  confirmed  that  it  was  in  correct  sequence  by 
checking  the  Federal  Register  book.  If  not,  the  incorrectly  sequenced  image  was  copied  to  its 
correa  location  and  a missing  page  candidate  was  flagged. 

Blank  page  and  missing  page  candidates  required  adjudication  using  the  actual  printed  Federal  Register 
page,  Seaion4.3.3.3. 

4.3.3.2  Non-Blank  Page  Image  Adjudication 

Failures  wQ-e  routed  to  this  stage  by  an  operator  (1)  identifying  a non-blank  page  for  which  no  page 
number  was  isolated  or  (2)  failing  to  adjudicate  a subimage  from  the  information  content.  At  this  stage, 
an  opo’ator  categorized  whether  or  not  a page  was: 

□ Scanned  in  the  wrong  sequence. 

□ Printed  with  missing  or  illegible  digit(s) 

An  incorrectly  sequenced  image  was  copied  to  its  correa  location  and  a missing  page  candidate  condition 
was  flagged.  The  missing  page  candidate  was  adjudicated  using  the  same  procedures  for  adjudication  of 
non-blank  pages  desaibed  in  Seaion  4.3.3.3. 

At  times  the  page  numba  isolation  software  isolated  and  aeated  a subimage  that  an  operator  could  not 
validate  from  the  information  contained  within  the  subimage.  Whenever  a page  number  contained  within 
a subimage  was  not  readable  by  an  operator,  the  image  was  verified  to  be  correa  by  comparing  the  full- 
page  image  content  with  the  printed  paper  page  information  content.  Figure  22  Illustrates  this  case;  the 
low  orda  digit  is  absent  from  the  printed  Federal  Register  page.  Only  examination  of  the  printed  page 
content  revealed  whaha  or  not  the  image  file  was  the  actual  scanned  paper  page. 


12279.1 

Figure  22.  Image  Snippet  contained  Unprinted  Low  Order  Page  Number  Digit. 

4.3.3.3  Paper  Adjudication 

Blank  pages  from  Seaion  4.3.3. 1 and  missing  page  candidates  from  Section  4.3.3. 2 required  adjudication 
by  an  opaator  using  the  papa  pages  of  the  corresponding  Federal  Register  book.  If  an  actual  papa  page 
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was  verified  to  be  blank,  the  full-page  image  was  deleted  from  within  the  “mm/dd”  direaory  hierarchy. 
If  non-blank,  the  original  Federal  Register  page  was  missing  and  was  scanned. 

4.3.4  Compare  OCR  Results  to  Image  Filename 

Charaao's  in  each  page  numbo'  subimage  ware  recognized  and  classified  by  the  NIST  OCR  software.  It 
output  two  files  containing  recognition  results:  hypothesis  strings  and  confidence  values.  For  each 
charaaer  that  was  segmented  and  classified,  the  hypothesis  file  contained  an  ASCII  character  and  the 
confidence  file  contained  a value  that  represented  the  confidence  of  correa  charaaer  classification.  The 
process  we  used  to  validate  images  did  not  use  the  recognition  confidence  values.  Instead,  we  chose  to 
rely  upon  an  exaa  match  of  the  ASCII  OCR  results  with  the  numaic  portion  of  its  filename. 

For  each  page  number  subimage,  its  filename  string  (truncated  to  exclude  leading  zaos)  was  compared 
with  the  OCR  results  string  contained  within  its  associated  hypothesis  file.  If  an  exaa  string  match  was 
made,  the  image  was  assumed  to  contain  the  correa  Federal  Register  page  and  that  it  was  correctly  stored 
within  the  “mm/dd”  direaory  hierarchy.  If  the  strings  did  not  match,  the  file  was  classified  as  a 
mismatched  subimage  that  required  adjudication  by  a key  entry  opaator. 


Figure  23.  Page  Number  Isolation  Error. 


4.3.5  Subimage  Adjudication  using  Page  Number 

At  times  the  page  number  isolation  failed  due  to  excessive  skew,  too  much  noise,  or  text  was  printed  too 
close  to  the  page  number.  An  example  of  the  latter  is  illustrated  in  Figure  23;  the  word  “Federal”  abuts 
the  page  number  field.  The  OCR  results  contained  correa  charaaer  classifications  for  the  page  number 
digits  and  erroneous  classifications  for  the  non-numeric  charaaers.  We  could  have  implemented  data 
validation  software  that  could  have  daeaed  this  condition  and  daamined  that  there  was  a matching 
condition;  howeva,  we  chose  to  rely  on  a human  being  to  adjudicate  this  type  of  failure. 

Page  number  subimages  were  amassed  for  high-speed  human  adjudication.  Using  a high  resolution  visual 
display  terminal,  an  operator  was  presented  with  (1)  a window  that  displayed  the  ASCII  image  filename 
in  its  title  area  and  a subimage  in  the  display  area  and  (2)  another  window  that  allowed  an  operator  to 
respond  and  verify  whaher  or  not  the  page  number  was  eitha  correa  or  incorrea. 

After  viewing  the  page  numba  subimage  and  visually  reading  and  comparing  it  with  the  filename  string 
displayed  in  the  title  area  of  its  window,  an  operator  responded  with  single  keystroke  of  either  “y”  or  “n”. 
If  an  opo-ator  accepted  the  page  number,  it  was  assumed  that  the  page  was  scanned  correctly  and  the 
image  file  was  correctly  positioned  within  the  “mm/dd”  directory  hierarchy.  If  not,  the  image  file 
required  further  adjudicatioa 

4.4  Roman  Page  Numbered  Image  Verification 

Figure  24  illustrates  the  process  steps  which  validated  the  prefix  and  appendix  pages,  “a(XXX)0nn”  and 
“cOOOOOnn”,  of  the  Federal  Register.  It  consisted  of  Seaions  4.4.1  Sq^arate  Cover  Pages  from  Content 
Pages  and  4.4.2  Covo'  Page  Image  Adjudicatioa 
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4.4.1  Separate  Cover  Pages  from  Content  Pages 


Since  cover  page  image  files  (“aOOOOOOO”  and  “aOOOOOOl”)  do  not  contain  page  numbCTS,  these  files  were 
sqjarated  and  sent  to  an  operator  for  adjudicatioa 

All  non-cover  page  (“a0000002”  through  “AOOOOOnn”  and  “cOOOOOnn”)  image  files  were  verified  by 
using  process  stq^s  similar  to  the  ones  described  in  Sections  4.3.1,  and  4.3.5. 
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4.4.2  Cover  Page  Image  Adjudication 


Each  hard  cover  page  (“aOOOOOOO”)  and  soft  cover  page  (“aOOOOOOl”)  was  viewed  by  an  operator  at  a 
high  resolution  visual  display  terminal,  displaying  a 2.6  pixels  per  millimeter  (66  PPI)  reduced  full-page 
image.  Since  these  pages  are  distinctive  (Figure  1 and  Figure  2),  an  operator  quickly  confirmed  whetha- 
or  not  these  pages  were  correctly  stored  within  the  “mm/dd”  directory  hia-archy.  If  not,  an  out  of 
sequence  condition  was  found  and  the  original  Federal  Register  page  was  rescanned. 

4.5  Image  Sequence  Check  Verification 

The  final  stq)  within  this  verification  process  entailed  performing  an  image  sequence  check  that  deteaed 
and  examined  page  numbering  gaps  within  the  sequence  of  image  files  within  the  “mm/dd”  direaory 
hierarchy.  Each  gap  was  either  explained  or  correaed.  Numerical  gaps  were  caused  by  either  failing  to 
scan  Federal  Register  page(s),  blank  pages  detected  and  deleted  by  the  validation  processing,  or 
legitimate  page  number  increments  made  by  the  GPO.  An  operator  adjudicated  these  conditions  by 
reviewing  processing  logs  and/or  verifying  the  gap  condition  within  the  Federal  Register. 

5.0  SUMMARY 

We  realize  that  the  cost  of  scanning  and  verifying  any  voluminous  document  collection,  such  as  the 
Federal  Register,  is  a tedious  and  expensive  process.  It  cost  $8,000  to  scan  the  approximately  67,000 
pages  of  the  Federal  Register.  In  order  to  reduce  the  verification  cost,  we  decided  to  minimize  human 
resource  expenditure  by  exploiting  OCR  and  high-speed  visual  adjudication  by  an  opaator.  Usage  of 
OCR  allowed  us  to  automatically  validate  and  exclude  over  83%  of  the  images  from  being  adjudicated  by 
a human.  Of  the  remaining  17%  of  images,  ova-  90%  of  these  images  were  validated  by  high-speed 
operator  adjudication;  this  minimized  expensive  paper  handling. 

Even  though  the  validation  was  semi-automated,  it  was  conduaed  by  highly  skilled  professionals  at  NIST 
and  required  a one-person  month  of  labor  costing  approximately  $35,000.  We  believe  that  certain 
subjeaive  judgements  are  best  made  by  technically  oriented  image  processing  professionals.  If  a lesser 
labor  category  were  substimted,  the  cost  of  validation  could  be  reduced  at  the  expense  of  quality. 
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